Compression of Strings with Approximate Repeats

نویسندگان

  • Lloyd Allison
  • Timothy Edgoose
  • Trevor I. Dix
چکیده

We describe a model for strings of characters that is loosely based on the Lempel Ziv model with the addition that a repeated substring can be an approximate match to the original substring; this is close to the situation of DNA, for example. Typically there are many explanations for a given string under the model, some optimal and many suboptimal. Rather than commit to one optimal explanation, we sum the probabilities over all explanations under the model because this gives the probability of the data under the model. The model has a small number of parameters and these can be estimated from the given string by an expectation-maximization (EM) algorithm. Each iteration of the EM algorithm takes O(n2) time and a few iterations are typically sufficient. O(n2) complexity is impractical for strings of more than a few tens of thousands of characters and a faster approximation algorithm is also given. The model is further extended to include approximate reverse complementary repeats when analyzing DNA strings. Tests include the recovery of parameter estimates from known sources and applications to real DNA strings.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Detection of Signiicant Patterns by Compression Algorithms : the Case of Approximate Tandem Repeats in Dna Sequences. Rivals

0 To whom the reprint requests should be sent. 2 Abstract We use compression algorithms to analyse genetic sequences. The basic idea is that a compression algorithm is associated with a property. The more a sequence is compressed by the algorithm, the more signiicant is the property for that sequence. Here we present an algorithm to detect a particular type of dosDNA (Deened Ordered Sequence-DN...

متن کامل

Dna Data Compression Algorithms Based on Redundancy

Carl Jung said, 'Collective unconscious' i.e. we are all connected to each other in some way or the other via our DNA. In frequent cases there are four bases in a DNA. They are a (Adenine), c (Cytosine), g (Guanine) and t (Thymine). Each of these bases can be represented by two bits as 2 powers 2 =4 i.e. a – 00, c – 01, g – 11 and t – 10 respectively, although this choice is random. So redundan...

متن کامل

A First Step Toward Chromosome Analysis

In this paper, we use Kolmogorov complexity and compression algorithms to study DOS-DNA (DOS: de-ned ordered sequence). This approach gives quantitative and qualitative explanations of the regularities of apparently regular regions. We present the problem of the coding of approximate multiple tandem repeats in order to obtain compression. Then we describe an algorithm that allows to nd eecientl...

متن کامل

Detection of significant patterns by compression algorithms: the case of approximate tandem repeats in DNA sequences

MOTIVATION Compression algorithms can be used to analyse genetic sequences. A compression algorithm tests a given property on the sequence and uses it to encode the sequence: if the property is true, it reveals some structure of the sequence which can be described briefly, this yields a description of the sequence which is shorter than the sequence of nucleotides given in extenso. The more a se...

متن کامل

Repeats and Palindromes: an Overview

With a long text string like DNA, repeats and palindromes are not easily spotted. Yet nding such substrings is important; for instance, repeats in DNA are indicators of certain hereditary disorders and are used as genetic markers. We discuss repeats and then palindromes and then we relate the two. In our discussion of repeats, we rst de ne an exact repeat and then ve de nitions of approximate r...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Proceedings. International Conference on Intelligent Systems for Molecular Biology

دوره 6  شماره 

صفحات  -

تاریخ انتشار 1998